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Abstract 

. 

p I , We investigate the nature of the alignment with gaps corresponding to a Longest 

f-n* | Common Subsequence (LCS) of two random sequences. We show that such an 

alignment, which we call optimal, typically matches pieces of similar length. This 
is of importance in order to understand the structure of optimal alignments. We 
also establish a method for showing that a certain class of properties typically holds 
in most parts of the optimal alignment. The assumption being that the property 
considered has high probability to hold for strings of similar short length. The 
present result is part of our general effort to obtain the order of the variance of the 
LCS of random strings. 



1 Introduction 



Let x and y be two finite strings. A common subsequence of x and y is a subsequence 
which is a subsequence of x as well as of y. A Longest Common Subsequence (LCS) of x 
and y is a common subsequence of x and y of maximal length. 



Throughout, let X and Y be two random strings X — X\ . . . X n and Y = Y\ . . . Y n , 
and let LC n denote the length of the LCS of X and Y. 

As well known, common subsequences can be represented as alignments with gaps; and 
this is illustrated next with some examples: 

First take the binary strings x — 0010 and y — 0110. A common subsequence is 01. We represent 
this common subsequence as an alignment with gaps. We allow only for alignments which align the 
same letters or letters with gaps. We represent a common subsequence by aligning the letters of the 
subsequences from each word. The letters which do not appear in the common subsequence get aligned 
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with a gap; several alignments can represent the same common subsequence. In this first example, an 
alignment corresponding to the common subsequence 01 is given by 
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i 






(1.1) 



However, the LCS is not 01, but 010. We call an alignment corresponding to the LCS an optimal 
alignment. Hence, (jl.l[) is not an optimal alignment, but an optimal alignment is given by: 



X 










1 







V 









1 


1 






Here the LCS is LCS(x, y) = 010, and its length is 3 a fact that we denote by 

\LCS(x,y)\ = 3. 

Let us consider another example: let x — christian and y — krystyan. In this situation, the LCS of x 
and y is LCS(x,y) — rstan and the alignment with gaps representing the LCS is: 
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(1.2) 



Again, all the letters which are part of the LCS get aligned with each other, while the other letters 
get aligned with gaps. Often, we say that, in a given alignment, a part of x gets aligned with a part 
of y, and here is what is meant: In the above example, x$xg = an is aligned with 7/72/8 = an , while 
X5X6X7X8X9 = stian is aligned (with gaps) with 2/42/52/62/72/8 = styan. In this situation, we say that [5,9] 
is aligned with [4, 8]. (We will also sometimes say that [5, 9] gets mapped onto [4, 8] by the alignment we 
consider). This means that the following two conditions are satisfied: 

i) the letters x^XQXyXgXg are all aligned exclusively with gaps or with letters from the string 2/4 2/52/6 2/7 2/s- 

ii) the letters from 2/42/52/62/72/8 are all aligned with gaps or with letters from the substring x^XeXrXsXg. 

For further clarification see that in the alignment (ll.2j) . the interval [1,4] is aligned with [1,2]. In 
other words, we say that in an alignment a piece of x gets aligned with a piece of y, if and only if the 
letters from the piece of x which get aligned to letters get only aligned with letters from the piece of y 
and vice versa. 

Longest Common Subsequences (LCS) and Optimal Alignments (OA) are important 
tools used in Computational Biology and Computational Linguistics for string matching, 
e.g., see [TU], [3]. It is known and due to Chvatal and Sankoff [3] that the expected length 
of the LCS divided by n converges to a constant 7. But even for the simplest distributions, 
the exact value of 7 is not known 

In several special cases (e.g., [5], [6], [7]), the long open problem of finding the order of 
the variance of the LCS and the OA has been solved. Such is the case, for example, with 
binary sequences with and 1 having very different probabilities from each other. In all 
these cases, it turned out that the variance of the LCS is of linear order in the length of 
the sequences. This is the order conjectured by Waterman [9], for which Steele [8] has 
an upper bound of such order, whilst Alexander [1] determined the speed of convergence. 
The most important cases, like for example i.i.d. sequences with equiprobable letters 
remain, however, open as far as this order is concerned. 
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In [6] and [7j, for determining the linear order of the variance of the LCS, we used 
ad-hoc (somewhat intricate) combinatorial arguments, despite the fact that the situation 
there is less involved than for the general case. In this paper, we show a general method 
to prove certain properties of the optimal alignment, given that typically the property 
holds for alignments of short strings. 

We investigate here the nature of the optimal alignments of random strings, i.e., of the 
alignments corresponding to LCSs. For this, and throughout, we take two independent 
random strings X = X\ . . . X n and Y = Y\ . . . Y n and further assume that X and Y are 
both iid sequences drawn from a finite ordered alphabet. We denote by LC n the length 
of the LCS of X and Y. 

To do so, we are going to partition X into pieces of length k, fixed, as n goes to infinity, 
and prove that typically, in any optimal alignment, most of these pieces get aligned with 
pieces of Y of similar length. More precisely, we assume throughout that 

n — m-k. 

Assume that the integers 

r = < n < r 2 < r 3 < . . . < r m _i < r m = n, (1.3) 

are such that 

m 

LC n = \LCS(Xk(i-i)+iXk(i-i)+2 ■ ■ ■ Xku Y ri _ 1+ iY ri _ 1+ 2 ■ ■ ■ Yn)\- (1-4) 
i=i 

The above condition just indicates that there exists an optimal alignment which maps 
[k(i — 1) + 1, ki] onto [rj_i + 1, rj for all i — 1, 2, . . . , m. 

The first goal of the present paper is to show that for k fixed, and n large enough, any 
generic optimal alignment is such that the vast majority of the intervals [r^i + l,rj\ are 
close in length to k. Another goal is to show that if a certain property V holds with high 
probability for any optimal alignment of strings of (short) length order k, then typically 
any optimal alignment has a large proportion of parts of order k having the property V . 
This is proven in Section HI 

Let us get back to the first goal of this paper. That is, we will show that with high 
probability if the integers ro,ri, . . . , r m satisfy (jl.3p and (11.41) then most of the lengths 
rj — r,_i are close to k. 

Of course, we need to quantify what is meant by "close to k" . To do so, we first need 
a definition. For p > 0, let 

, r E[\LCS(X 1 X 2 ...X n ;Y 1 Y 2 ...Y pn )\] 
7 (p) := hm — . (1.5) 

n->oo 77,(1 + p)/2 

This function 7* is just a new parametrization of the usual function 
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j(q) = lim 



W^LC S (X\X 2 . . . X n - ng ] Y\Y 2 • • • Y n+qn ) 



n 

q G [-1,1], i.e., 



A subadditivity argument as in Chvatal and Sankoff [I] shows that the above limits do 
exist. When X and Y are identically distributed, then the function 7 is symmetric about 
the origin, while a further subadditivity argument shows that it is concave and so it reaches 
its maximum at q — 0. In general, it is not clear, whether or not it is strictly concave at 
q = 0. From simulations it seems almost certain that the function 7 is strictly concave 
at p — 1. This however might be very difficult to prove. (The LCS problem is a Last 
Passage Percolation problem with correlated weights. In general, proving for First/Last 
passage percolation that the shape of the wet zone is strictly concave seems difficult and 
in many cases has not been done yet.) Note that q(p) = (p— l)/(p + 1) = 1 — (2/(p + 1)). 
is strictly increasing in p and is equal to for p = 1. So, if 7(-) is strictly concave at 
q = 0, then it reaches a strict maximum at that point. In that case, 7*(-) would reach a 
strict maximum at p — 1. Without the strict concavity of 7(-), p = 1 would not be the 
unique point of maximal value. 

Usually however, there are specific methods for showing that 7*(p) is strictly smaller 
than 7*(1) as soon as p is further away, than a given small quantity, from 1. Here, 7* is 
non-decreasing on [0, 1] and then non- increasing on [1, 00). The value of 7* at 1 is simply 
denoted by 7* := 7(0) = 7*(1). 

So assume that we know that p\ and P2 are such that 

7 *(pi)<7*(l) , 7> 2 )<7*(1) (1.7) 

while 

0< Pl <Kp 2 . (1.8) 

The main result of this paper is that if n is large enough (we take k fixed and let n go 
to infinity), then for all optimal alignment typically we have that most of the intervals 
[rj_i + l,Tj] (for i = 1,2, ... , m) have their lengths between kpi and kp%. By most, we 
mean that by taking k fixed, and n large enough, that proportion gets typically as close 
to 100% as we want. 



Let e > 0, piO and p 2 > be constants. Let A™ pi p2 be the event that a proportion of 
no less then 1 — e of the intervals [r»_i + 1, ri\, i = 1,2, ... ,m have their length between 
kpi and kp 2 for any optimal alignment of X± . . . X n and Y± . . . Y n . More precisely, let 
A™ puP2 be defined as the event that for all integer vectors (r , 7*1, ... , r m ) satisfying ( 11.31) 
and ( 11.41) . we have that 

card ( {i G l,2,...,m : kp\ < Ti — r^-i < kp 2 } ) > (1 — e)m. (1.9) 
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Theorem 1.1 Let e > 0. Let pi < 1 < p 2 be such that 7*(pi) < 7 * (l),7*(p2) < 7*(1)- 
Letk>\ and let 5 G (0, min(7*(l) - 7*(i?i), 7*(1) -7*(p 2 )))- ITien, 

P (^ pijP2 ) > 1 - exp(-n (- ln(efc)/* + 5 2 e 2 /16) ), 

for all n = n(k, e, 5) large enough. 

Before proving the above theorem, let us mention that the results presented here are yet 
another step in our attempt at finding the order of the variance of LC n (see [5], [7j, [6j 
and the references therein), and they will prove useful towards our ultimate goal ([2]) 

2 Proof of the main theorem 

In this section we prove our main Theorem 11.11 To do so, we will need to define a few 
things: So far we have looked at the intervals on which an optimal alignment would map 
the intervals [k(i — 1) + l,ki]; and we are now going to take the opposite stand: we 
will give non-random integers r = < 77 < r 2 < . . . < r m = n and request that the 
alignment aligns [k(i — 1) + 1, ki] onto [rj_i + 1, 77] for every % — 1, 2, . . . , m. In general, 
such an alignment is not optimal and the best score an alignment can reach under the 
above constraint is given by: 

m 

L n (r) := L n {r , r±, . . . , r m ) := |LCS , (X fc (j_ 1 ) +1 X fc (j_ 1 ) +2 . . . Xu] 5 / r i _i+i^ / r 4 _i+2 • • • K-JI- 

i=l 

where f = (ro, r±, . . . , r m ). Hence, the quantity L n (r) = L n (ri, . . . , r m ) represents the 
maximum number of aligned identical letter pairs under the constraint that the string 
X(i-.i) k+1 X(i-i) k+ 2 . . . X ik gets aligned with Y ri _ 1+1 Y ri _ 1+2 . . . Y n for all % = 1,2, ...,m. 
Note that for non-random f = (r , 77, . . . , r m ), the partial scores 

— i)fe+2 • • • Xi k . Y Ti _ 1 j r \ Y Ti _ 1 j r 2 • • • Y n I, 

are independent of each other, and concentration inequalities will prove handy when 
dealing with L n (f) = L n (r , 77, ... , r m ). Let 

denote the (non-random) set of all integer vectors f— (ro,ri, . . . ,r m ) satisfying the con- 
ditions CL3D, and Let 

/v -€,pi,p 2 ' 

denote the (non-random) set of all integer vectors r — (r ,?7, . . . ,r m ) satisfying the con- 
dition (OIL but not (Q]l . 
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To determine elements in TZ% Pl P2 we need to pick m elements from the set {1,2, ... ,n}. 
Hence we get the following upper bound for the number of elements in the set Tl% Pl)P2 - 

by a well known and simple bound on binomial coefficients. 

Now, let 5 := min(7* — 7*(pi), 7* — 7*(p2))- By definition LC n is always larger or equal 
to L n (r). For f to "define an optimal alignment" we need to have: 

L n (f) > LC n . (2.2) 

Hence for the event A™ not to hold, there needs to be at least an element r in TZ^ pip2 
for which (12. 2p is satisfied. This means that 

AZ UP2 = Uren lpi , P2 {L n (r) - LC n > 0}, 

and thus 

P « C P1 ,J< E nL n {r)-LC n >0). (2.3) 

When r G T^-t PlP2 , the expectation of L n (r) — LC n is, for n large enough, bounded above 
as follows: 

E[L n (r) - LC n ] < -0.55emk. 

(The proof of this fact is given in Lemma [2.11 ) With the last inequality above, we find 
that 

P(L n (f) - LC n > 0) < P (L n (r) - LC n - E[L n (f) - LC n ] > 0.55ekm) . (2.4) 

Note that the quantity L n {f)—LC n changes by less than 2 units, when we change any of the 
i.i.d. entries X%, X 2 , . . . , X n ; Y\, Y 2 , . . . , Y n . Hence, we can apply Hoeffding's martingale 
inequality to the right side of (12.4j) . to obtain 

P(L n (f) - LC n > 0) < P( L n (f) - LC n - E[L n (f) - LC n ] > 0.55en) < exp(-nd 2 e 2 /16). 

(Recall that Hoeffding's inequality ensures that if / is a map in / entries, so that changing 
any one single entry affect the value by less than a, then 

P ( /(Wi, W 2 ,..., W t ) - E[f(Wx, W 2 ...,W l )]>Al)< exp(-2/A 2 /a 2 ), 

provided the variables W\, W 2 , . . . are independent.) Combining the last inequality above 
with (12. 3p . one obtains 

P« C P1 , P2 ) < \Kpup 2 \ exp(-n5 2 6 2 /16). (2.5) 
But, by the equation (12.11) . the set ^ Ps contains less than (ek) m elements so that out 
of (12311 . we get 

P(^"p llP2 ) < (eA;) m exp(-n(5 2 e 2 /16) = exp(-n (-\n{ek)/k + <5 2 e 2 /16) ). (2.6) 



This will finish the proof of Theorem 11.11 provided we prove: 



6 



Lemma 2.1 Let 5 > be such that 5 < min(7*(l) — 7*(pi), 7*(1) — 7*(P2)))- Assume 
that f = (r , . . . , r m ) G T^t PuP2 - Then for n large enough, we have 

E[L n (r) - LC n ] < -0.5<5en (2.7) 

Proof. Assume that we compute the LCS of the string X( i _ 1 ) fc+1 X(j_ 1 ) fe+2 . . . X ik and of 
the string Y r ._ 1+l Y r ._ 1+2 . . .Y r .. Let 5* := min(7*(l) - 7*0i), 7*(1) - 7*(P2)))- If^-r^i 
is not between p x A; and p 2 /c, then the rescaled expected value is below 7* by at least 5*. 
(Rescaled by the average of the lengths of the two strings). Hence, 

* _ E[|LC5'(X( i _i) fc+ iX(j_i) fc+ 2 . . ■ X ik ,Y ri _ 1+1 Y r ._ 1+2 . . .Y r .)\] ^ 

7 o^k + n-n-!-!) 

if the length r« — r^_i is not in [fcpi, fc^]- One of the strings has length k. Hence, not 
rescaled we are below 7* by at least 5*k/2. By definition, any alignment belonging to 
T^e,p!,p 2 nas a proportion of at least e of the intervals X(j_i)fc+iX(j_i)fe +2 . . . which get 
matched on strings of length not in \pik,p2k]. This corresponds to a total number of em. 
For each of these intervals we are below the expected value, which would correspond to 7* 
times the average length, by at least 5*k/2. Hence, the expected value for any alignment 
of 72.g )Pl p2 is below j* n by a ^ least (6*k/2)(em) = n5*e/2. This means that 

7*n - EL n {r) > n5*e/2, (2.8) 

as soon as r = (r , r 1; . . . , r m ) is in T^e, Pl , P2 - ^ ow E[LC n ]/n — > 7* as n goes to infinity. 
Note, that by definition, 5* — 5 > 0, and so for all n large enough, we will have 



* E[LC n ] 
7 - 



< (5* - 8)e/2. (2.9) 



n 

Combining ( 12. 8 ft and ( 12. 9h yields that for all n large enough, we have 

E[LC n ] - L n (r) > 0.5n5e, 
as soon as r G . The proof is now completed. ■ 



3 Closeness to the diagonal 

Let us start by explaining how we can represent alignments in two dimensions by consid- 
ering an example. Take the two related words: the English X = mother and the German 
Y = mutter. The longest common subsequence is mter and hence LCq = 4. As men- 
tioned, we represent any common subsequence as an alignment with gaps. The letters 
appearing in the common subsequence are aligned one on top of the other. The letters 
which are not aligned with the same letter in the other text get aligned with a gap. In 
the present case the common subsequence mter corresponds to the following alignment: 
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(3.1) 
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An alignment corresponding to a LCS is called an optimal alignment. The optimal align- 
ment is, in general, not unique. For example, to the same common subsequence mter 
corresponds also the following optimal alignment: 
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(3.2) 



In the following, we represent alignments in 2 dimensions. For this we view alignments 
as subsets of M 2 , in the following manner: If the i-th letter of X gets aligned with 
the j-th letter of Y, then the set representing the alignment is to contain For 
example, the alignment ( 13. ip can be represented as follows: (1, 1), (3, 3), (5, 5), (6, 6) with 
the corresponding plot 
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(3.3) 



Here, the symbol x indicates pairs of aligned letters. We say that these points represent 
the optimal alignment. 

The main result of the previous section implies that the optimal alignment must remain 
close to the diagonal. This is the content of the next theorem. But first we need a 
definition. Let 

D" 

e,pi,p2 

be the event that all points representing any optimal alignment of X\X 2 . . . X n with 
YiY 2 . . . Y n are above the line y = p\x — pine — p%k and below the line y = (l/pi)x + 
(l/pi)ne + {l/pi)k. 

Theorem 3.1 Let p x < 1 < p 2 be such that Yi.Pi) < 7 * (1)>7*(P2) < 7*(1)- Let k > 1, 
let 6 € (0, min{7*(p 1 ), 7*(p 2 )}) be a constant which does not depend on n. Then, for any 
e > fixed, we have for n large enough, 

HDZ^) < 2exp(-n (-ln(eA;)/A; + 5V/16) ). 

Proof. Let be the event that any optimal alignment of XiX 2 . . . X n with YiY 2 . . . Y n is 
above the line y(x) = p\x—p\nt—p\k] and let be the event that any optimal alignment 
of X\X 2 . . . X n with Y 1 Y 2 . . . Y n is below the line y = (l/pi)x + (l/pi)ne + (l/pi)k. Note 
that 

and thus 

P(D- iP2 )<P(L>r)+Pp 6 nc )- 
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By symmetry we have that F(D™ C ) = P(D£ C ). The last inequality above then yields 

P(I>™ jj < 2P(L»r). (3.4) 

Next, we are going to prove that 

Al PuP2 CD n a . (3.5) 

Let x < en be a multiple of /c. Let thus a be a natural number such that ak = x. 
Let us first consider the case where x < en. In this situation, we have that pix — piera 
is negative. But any alignment (and optimal alignment) we consider maps any x G [0, n] 
onto [0,n]. Hence for every xen the condition is always verified, that is any optimal 
alignment aligns x onto a point which is no less than p±x — p\en. 

Let us next consider the case where x > en. When the event A™ pi p2 holds, then any 
optimal alignment aligns all but a proportion e of the interval [(i — l)k + l,ik], i G 
{l,...,m} onto intervals of length longer or equal to p\k. The maximum number of 
intervals which could be matched on intervals of length less than p-Jz is thus em. In the 
interval [0,x] there are a intervals from the partition [(i — l)k + l,ik], i G {1, . . . , m}. 
Hence, at least a — em of these intervals are matched onto intervals of length no less 
than pik. This implies that, when the event A™ pip2 holds, we find that the point x gets 
matched by the optimal alignment on a value no less than 

(a — em)kp\. 

Noting that and that mk = n the above bound becomes 

P\X — p\en. 

If x is not a multiple of k, let X\ denote the largest multiple of k which is less than x. By 
definition, we have that 

x — x\ < k. (3.6) 

The two-dimensional alignment curve cannot go down, hence, we have that x gets aligned 
to a point which cannot be below to where x\ gets aligned. Now, for x\, since it is a 
multiple of k, we have that it gets aligned on a point which is larger or equal to 

PiXi — p\en. 

Using (13.61) . we find 

P\X\ — pien > p\x — p\en — p\k. 

We have just proven that when the event A™ pi p2 holds, the point x gets aligned above or 
on the point p\X — p\en — pik. This finishes proving that the event A™ p2 is a subevent 

of 

Since A™ pup2 C D£, we get 

F(DT)<F(A^ pup2 ). 



But by Theorem ll.il the last probability above is upper bounded by: 

exp(-n (- ln(ek)/k + 5 2 e 2 /16) ), 
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so that P(£>" c ) < exp(-n (- \n(ek)/k + 5 2 e 2 /lQ) ). Hence, (HOD becomes: 
Pp"p liP2 ) < 2exp(-n (-^(e^A + ^eVie) ). 

■ 

Theorem 13.11 allows to reduce the time to compute the LCS for two random sequences. 
First note that by rescaling the two-dimensional representation of optimal alignments by 
n, it implies that, with high probability, up to a distance of order e > any optimal 
alignment is above the line x — > p\X and below x — > P2X. Moreover, in the theorem we 
can take e > as small as we want, (leaving it fixed though when n goes to infinity). 
Simulations seem to indicate that the mean curve 7* is strictly concave at p = 1. If this is 
indeed true then we can take pi as close to one as we want and it will satisfy the conditions 
of the theorem. That is, we could then take e as close to as we want and p\ as close 
as close as we want to 1. Hence, the rescaled two-dimensional representation of the op- 
timal alignments gets uniformly as close to the diagonal as we want when n goes to infinity. 



Figure 1, below, is the graph of a simulation with two i.i.d binary sequences of length 
n = 1000. All the optimal alignments are contained between the two lines in the graph 
below. We see that all the optimal alignments stay extremely close to the diagonal: 
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score LCS=538 , longest vertical distance=26, longest horizontal length=112 




100 200 300 400 500 600 700 800 900 1000 

X-sequence of length 1 000 



Figure 1: n=1000, uniform Bernoulli sequences, k=2 



4 Proving a property of the optimal alignment 

Let V be a map which assigns to every pair of strings (x, y) a 1 if the pair (x, y) has a 
certain property and otherwise. Hence, if A is the alphabet we consider, then 

V:{U k A k ) x (U k A) k ->{0,1}. 

If V(x, y) = 1 we will say that the string pair (x, y) has the property V. 
Let e > be any fixed number strictly larger than and let f = (r , r 1; . . . , r m ) be an 
integer vector satisfying condition (11. 3p . Let Bp(f,e) denote the event that there is a 
proportion of at least 1 — e of the string pairs 

-i)fc+i—- X ik ;Y ri _ 1+1 ...Y n ) (4.1) 

satisfying the property V. In other words, the event 

holds if and only if 

m 

J2nX(i-i)k+i....X ik ;Y n _ 1+l ...Y ri ) > (l-e)m. 

i=l 
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Let -Bp(e) denote the event that for every optimal alignment the proportion of aligned 
string pairs (14. ip satisfying the property V, is more than 1 — e. Hence, the event BJp(e) 
holds if and only if for every vector r = (r ,r 1; . . . ,r m ) satisfying (II .3p and such that 
LC n = L n (r), the event BJp(r,e) holds. 

Most of the time, the properties we want for string pairs only holds with high probability 
if the two strings have their lengths not too far from each other. Let q be a (small) 
constant so that q G [0, 1]. Assume that as soon as rj — rj_i G [kpi, kp 2 ], the probability 
that that string pairs (14. ID has the required property is above 1 — q. Hence, assume that 
for every r x G [kpi, kp 2 ] we have: 

F(V(X 1 ...X k ,Y 1 Y 2 ...Y n ) = l)>l-q. 

We are going to investigate next how small q = q(k) needs to be, in order to insure 
that a large proportion of the aligned string pairs ( 14. 11) have the property V (for every 
optimal alignment). Recall that A™ pip2 denotes the event that every optimal alignment 
aligns a proportion larger/equal to 1 — e of the substrings Xu^ k+ i . . . X ik to substrings 
of Y with length in \pik,p 2 k]. Also, recall that 1Z^ P1>P2 denotes the set of integer vectors 
r = (r ,r 1; . . . ,r m ), satisfying (II. 3p and such that there is more than (1 — e)m of the 
differences — r»_i in [kpi, kp 2 ]. 

We will need a small modification of the event Bj,(f, e). For this let B-p(r,e) denote 
the event that among the aligned string pieces (14.11) . there are no more than me which do 
not satisfy the property V and have their length r, — rj_i in [kp±, kp 2 \. We find that 

A n (e 1 ,p 1 ,p 2 ) H (n^ Ei , pi , 2 ^(f,e 2 )) C £$( ei + e 2 ), 

so that 

P(^ c ( ei + 6 2 )) < F(A nc (e uPuP2 )) + P W c (r,e 2 )) (4.2) 

We find the bound 

P(^ c (r,6 2 ))< 

Noting that ( e ^ n ) is bounded above by exp(if e (e 2 )m), where H e is the base e entropy 
function, given by H e (x) = —x \nx — (1 — x) ln(l — x), < x < 1, we find the bound 

P(^ c (f,e 2 )) < <f 2m exp(# e (e 2 )m). (4.3) 

We can now apply inequality (14.31) to inequality (14. 2p . For this note that in the set TZ ejPljP2 
there are less than (3k) m elements as we proved in Section [2j Thus we obtain 

P(^ c (e 2 )) < nA nc (e uPl ,p 2 )) + (3k) m q f - 2m exp(H e (e 2 )m) (4.4) 

Taking q(k) equal to 1/ '((Qk) 1 ^ 2 ), yields 

P(^ c ( ei + e 2 )) < F(A nc (e 2 , Pl ,p 2 )) + exp( (H e (e 2 ) - ln(2))m ) (4.5) 
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Note that H e (e 2 ) < In 2 as soon as e 2 < 0.5. So, if we assume that e 2 < 0.5, then 
expression exp( (H e (e2) — ln(2))m ) is a (negatively) exponentially small quantity in m. 
We already, learned how to bound the probability of the event A nc (e2,Pi,P2)) in the 
previous sections. Hence, the inequality (14. 5p . allows to show that a high percentage of 
the aligned string pairs (14. ip . have property V in any optimal alignment. For this we just 
need to show that for pairs (14.11) with similar length, the probability q{k) is less or equal 
to 

1 



where 

q(k) := max P( the pair (Xi . . . X^; Y\ . . . Y ri ) has not property V ). 

ri&[kp 1 ,kp 2 ] 

This is the content of the next theorem, which is obtained by putting ei = e 2 = e/2: 

Theorem 4.1 Assume that p\ and p 2 o,re such that p\ < 1 < p 2 . Let 5 > be strictly 
less than min(7* — 7(pi),7* — 7(^2))- Let e > 0. Assume that there is a natural number 
k > 1 such that 

1 

P((Xl . . . Xk, Y± . . . Yi) does not satisfy property V) < 



(6fc) 2 A 

for any I G [kpi, kp2\- Then, for any optimal alignment f (i.e., such that LC n = L{f)), 
the proportion of string pairs ((X(j_i)fc + i....JTjfc; Y Ti _ 1+ i . . .Y n ) which have property V is 
above 1 — e with probability bounded below as given in the next inequality: 

P(B£(e) > l-P(A nc (0.5e,pi,p 2 ) - exp((# e (0.5e) -ln2)m) 

and hence by Theorem \l.l[ 

P(BJ(c)) > 1 -exp(-n(-ln(efc)/fc + 5 2 e 2 /64)) - exp(n(if e (0.5e) -ln2)/fc), 

for all n large enough. Hence, the probability that there are not at least a proportion of 
1 — e string pairs (14.11) having property V in every optimal alignment is exponentially 
small as n goes to infinity (while holding k and e fixed) as soon as there exists k > 1, such 
that: 

641n(efc) . , 

k > (4.6) 

and 1 

max P((Xi . . . Xk, Y\ . . . YA does not have property V) < , . (4.7) 
l&\pik,p 2 k] (6k) 2 ' e 

The above theorem is very useful for showing that when a certain property holds for 
aligned string pairs with similar lengths of order k, then the property holds typically in 
most parts of the optimal alignment. From our experience, for most properties we are 
interested in, when pi and p 2 are close to 1 but fixed, then the probability that 

(X 1 ...X k ,Y 1 Y 2 ...Y l ) 
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does not have a certain property is about the same for all I G [kpx, kp 2 \. In other words, 
how the alignment of X\ . . . Xk with Y\ . . . Yi, behaves does not depend very much on I as 
soon as I is close to k and k fixed. (We are not necessarily able to prove this formally in 
many situation though). Looking at (14. 7p we see that we need a bound for the probability 
on the left side of (14. 7p which is smaller than any negative polynomial order in k. (At 
least if we want to be able to take e as close as we want to 0. If we just want e > small 
but fixed, then a negative polynomial bound with a huge exponent will do). So, if the 
probability is for example of order k~ lnk or e~ ka for a constant a > 0, we get condition 
( 14. 7p satisfied by taking k large enough. Similarly, condition (14. 6 p always gets satisfied 
for k large enough. 

On the other hand, we could envision using Montecarlo simulation in order to find a 
likely bound for the probability on the left side of (14 .7p . Things then become much more 
difficult. Assume that you want e to be 0.2 and take 5 = 0.1. Then, by condition (14. 6p 
you find that k must be larger than: 

k > ln(64) + ln(25) + ln(100))64 ■ 25 ■ 100 « 1260000. 

The probability not to have property V for strings of length approximately k must be less 
(see (14 .7p ) than (6k)~ w , so that with our previous bound on k,we would get less than 

10~ 66 . 

The above number is way too small for Montecarlo simulations! Indeed, to show that a 
probability is as small as 10 -66 , one would need to run an order of 10 66 simulations. 



4.1 Further improvements 

There are several ways to improve on our bounds. First, we took as upper bound for 
7^e,pi,p 2 the value rj. One can find a better upper bound as follows: first note that if 
r = (r ,r l5 . . . ,r m ) G TZ 6:P1:P2 , then at least (1 — e)m of the lengths r i+1 — Tj are in the 
interval [p\k,p2k]. to determine these lengths we have at most 

((P2-Pi)k) m (4-8) 

choices. Then, there can be as many as em of the lengths rj + i — r^, which are not in 
[pik,p2k]. Choosing those lengths is like choosing at most em points from a set of at most 
n elements. Hence, we get as upper bound ( which, in turn, can be bounded by 

Finally, we have to decide which among the m lengths — r^i have there length in 
[kpi, kp 2 ] and which have not. That choice is bounded as follows: 



m 
em 



< exp(if e (e)m). (4.10) 
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> 3 )) 



Combining the bounds (|4T5]1 . ffl~9]) and f fl~9j) . yields 

(pa-pi)fc(M exp(tf e ( e ))J (4.11) 

With this improvement in the bound for the cardinality of TZ e ^ PltP2 , the inequality (I4.4p 
becomes: 

P(^ c ( ei +e 2 )) < F(A nc (e uPl ,p 2 ))+((p 2 - Pl )k {^j ' exp( J ff e (e 1 ))^ g eam exp(tf e (e 2 )m) 

(4.12) 

The last inequality above combined with Theorem 11.11 yields that 

P(£?£ c ( ei + e 2 )) 

is less than 

exp(-n(-ln(ek)/k + 5 2 e 2 /l6)) + ({P2 - Pi)k (^j exp(F e (e 1 ))g e2 exp(# e 

(4.13) 

The last expression above is exponentially small in n (we assume that we hold k fixed) , 
if the following two conditions are satisfied: 

k>^. (4.14, 

(ii) Assuming that q(k) denotes the maximum for I ranging over [kp\,kp 2 ] of the prob- 
ability that the property V does not holds for (xi . . . Xk] Y\ . . . Yi), the second condition 
is 

< — 2 - (4.15) 

((P2 - Pi)k (f )" exp(fT e ( ei ) + tf e (e 2 ))) 

Combining the last two conditions above yields: 

( e 2 s 2 \ 1 2 

q(k) < jr-^ . (4.16) 

\(p 2 - Pl )16\n(ek) (f) exp(F e (e 1 ) + F e (6 2 )) J 

The typical situation is that e\ + e 2 should be of a given order. So, we will try to find t\ 
and e 2 under the constraint e = e\ + e 2 , so that the right bound in (I4.16P is least small. 
For this note first that the power l/e 2 has much more effect on making the bound small 
than the expression e\ on top of the fraction bar. 

Note that exp(if e (ei) + H e (e 2 )) is just going to be a value between 1 and 2, and so will 
not have a lot of influence. Also, ((3k)/ei)\ is somewhat negligible compared to 3k. So 
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we are first going to disregard the quantities ((3k)/ex)\ and exp(H e (ei)) + if e (e 2 )). for 
this let g(k, 6%, e 2 ) be equal to 

/ £ 2^2 \ 1/62 

9{k ' ei ' e2)= W-pJ16MM)) ■ 

Note that g(k, ei, e 2 ) is larger than the bound on the right side of (I4.16p . This means that 
if g(k, ei, e 2 ) is too small for allowing some Montecarlo simulation, then the bound ( I4.16P 
is also too small! 

Note also, that when we hold all the parameters (pi,P2,5) fixed, the function g(fc,ei,e 2 ) 
is decreasing in e 2 as well as in e\. However, as already mentioned, e 2 "has more effect" 
in decreasing g(k, e-y, e 2 ) than e\. Hence, given e and given all that all the parameters are 
fixed (including k), when we want to maximize g(k, e\, e 2 ) under the constrain e\ + e 2 = e, 
€1, e 2 > we get something where e 2 will be quite a bit larger than e\. 

Could Montecarlo simulations be realistic with e = 0.1 and the bounds which 
we have? The answer is no. For this note that <5/G° 2 ~ Pi) gets at first better when 
we increase |p 2 — Pi\, since 7* has derivative at p — 1. when the interval [kpi,kp2] 
becomes too big however, than the property might no longer hold with high probability 
for all couple (X\ . . . Xj^, Y\ . . . Y\) with / 6 [kpi,kp2\. So, we will take [pi,p 2 ] so large 
as possible around 1, so that property still holds with high probability for all the string 
pairs mentioned before. With such a choice we get that S/(p 2 — Pi) can be treated as a 
constant. Somewhat optimistically say that the constant is less than 1/3. Now if e = 0.1, 
then ei,e 2 < 0.1. In that case, 

S (M,e 2 )< 9 (*,0.1,0.1)<( J7 ^ y ) 1 °. 

Returning to inequality f !4.14p and taking 5 = 0.2, we find that k must be larger than 
10 10 , so that ln(e/c) is more than 21. With this in mind, we find that g(fc,ei,e 2 ) in the 
present case where ei+e 2 = 0.1 is less than 10 40 , so no hope for Montecarlo simulation here! 

Motecarlo with e x = 0.1 and e 2 = 0.2: Take also 5 = 0.2 and 8/(p 2 - Pi) = 1/2. 
With these values, using (14.141) we find that k must be somewhat larger than 10 2 • 10 • 
81n(24000) w 10 5 . Then, also q(k) by (14151) should be less than 

(10 2 • 10 • 11 • 8)~ 5 w 10~ 25 . 

This is still an order that is difficult for Montecarlo simulation, but we get closer to some- 
thing which could be realistic. Now if we had e 2 = 0.3 instead, then we would get a bound 
10 -15 which still does not look too good for Montecarlo simulation! 
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When only dealing with the inequality (|4.15|) . things look somewhat better. Take 
k = 1000 and (pi — pi)k = 100, then the order for the left bound for q(k) is about 10~ 5 
which is feasible with Montecarlo! So, if we could find another method than the one 
described here to make sure that most of the pieces of strings X(j_ 1 ) fe+1 X(j_ 1 ) fc+2 . . . X ik 
are aligned with pieces of similar length we would be in business! 
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